Methods and systems for detecting deepfakes

ABSTRACT

A system for detecting synthetic videos may include a server, a plurality of weak classifiers, and a strong classifier. The server may be configured to receive a prediction result from each of a plurality of weak classifiers; and send the prediction results from each of the plurality of weak classifiers to a strong classifier. The weak classifiers may be trained on real videos and known synthetic videos to analyze a distinct characteristic of a video file; detect irregularities of the distinct characteristic; generate a prediction result associated with the distinct characteristic, the prediction result being a prediction on whether the video file is synthetic; and output the prediction result to the server. The strong classifier may be trained to receive the prediction results of the plurality of weak classifiers from the server; analyze the prediction results; and determine if the video file is synthetic based on the prediction results.

BACKGROUND

As machine learning techniques become more mainstream and accessible tothe public, the possibility of negative consequences arises. One suchexample is the phenomena of deepfakes. The term “deepfake” comes fromthe combination of “deep learning” and “fake”, as it uses deep learningtechniques and neural networks to create fake content, replacing theface of a person in an image or video with the face of another person.In some cases, audio may be added and the person may be made to looklike they are saying things that they never said. Early in theirlifetime, deepfakes were primarily used on various social media networksto create funny videos or fake pornographic videos.

However, due to the relatively low cost and technical ability requiredto create a deepfake, their prominence has grown tremendously, as hastheir quality. This is threatening to cause serious harm in regards topolitics, and other areas, as well. The possibility of fake videos onthe internet of candidates running for government positions sayingharmful things or the possibility of using fake videos to blackmailothers has created a significant problem for social media companies,especially in the wake of elections.

Many social media companies are motivated to prevent, detect, respond,and recover from security threats as they manifest on their respectivesocial media platforms. It is in their best interest to be able todetect deepfake videos and block them from appearing on their platforms.However, deepfake detection has stumped nearly everyone in the industrysince their arrival.

Deepfakes are typically generated with various types of convolutionalneural networks, due to their proclivity to work well with images. Typesof these networks can include generative adversarial networks (GANs),various autoencoders, or a combination of the two. Autoencoders, inrelation to deepfakes, involve training a network to recreate a specificimage, then using the network to recreate a separate image based on therecreation methods for the original image. GANs involve two separateneural networks that compete against each other, and this contributes tothe difficulty of detecting deepfakes. One of the networks (thegenerator) tries to create fake videos or images that will trick theother network (the discriminator), while the discriminator tries todetect fake videos or images created by the generator. Because of this,both networks learn and improve. So while deepfake detection maytheoretically improve in increments at times, this may cause thedeepfake generation to also improve.

Various attempts have been recorded at developing a robust deepfakedetection methodology, yet few, if any, have been deemed a success. Toprovide context, in 2018, 902 papers on GANs were uploaded to arXiv. Incontrast, during that same time period, only 25 papers on deep learningmethods for detecting tampered and synthetic imagery were published,including non-peer reviewed papers.

Some previously explored detection methodologies include signal leveldetection (sensor noise, CFA interpolation, double JPEG compression,etc.), physical level detection (lighting conditions, shadows,reflections, etc.), semantic level detection (consistency of metadata),and some physiological signal detection such as breathing and blinking.

Nearly all attempted techniques for deepfake detection involve strongclassifier: a network or classifier that is trained to bewell-correlated with true classification, meaning the network is trainedto learn to directly classify. In the case of deepfake, this wouldinvolve a single classifier that attempts to classify a video or imageas “real” or “fake”. This is in contrast to a weak classifier, or weaklysupervised learning. A weak classifier is a neural network trained toanalyze and detect certain characteristics. However, thesecharacteristics and resulting determinations, by themselves, may onlyloosely predict true classification. In some cases, a weak classifiermay be only slightly better at predicting a classification than a randomchoice.

SUMMARY OF THE INVENTION

Embodiments of the present disclosure relate to systems and methods ofdetecting synthetic videos. According to one aspect of the presentdisclosure, a system for detecting synthetic videos may include aserver, a plurality of weak classifiers, and a strong classifier. Theserver may be configured to receive a prediction result from each of aplurality of weak classifiers; and send the prediction results from eachof the plurality of weak classifiers to a strong classifier. Theplurality of weak classifiers may be trained on real videos and knownsynthetic videos to analyze a distinct characteristic of a video file;detect irregularities of the distinct characteristic; generate aprediction result associated with the distinct characteristic, whereinthe prediction result is a prediction on whether the video file issynthetic; and output the prediction result to the server. The strongclassifier may be trained to receive the prediction result of each ofthe plurality of weak classifiers from the server; analyze theprediction result from each of the plurality of weak classifiers; anddetermine if the video file is synthetic based on the predictionresults.

In some embodiments, the prediction result may be a numerical confidencelevel. In some embodiments, the known synthetic videos used to train theplurality of weak classifiers have had blurry frames removed. In someembodiments, a first weak classifier may be trained to detect a mouth inthe video file; extract the mouth from the video file; detectirregularities of the mouth, wherein irregularities may be associatedwith teeth or facial hair; and generate a first prediction result basedon the irregularities. In some embodiments, extracting the mouth fromthe video file may include down-sampling the mouth.

In some embodiments, extracting the mouth from the video file mayinclude extracting a pre-defined number of frames from the video file inwhich a mouth has been detected; extracting a mouth from each extractedframe; generating the first prediction result based on the mouth in eachextracted frame; and generating an average prediction result from thegenerated first prediction results. In some embodiments, a second weakclassifier may be trained to detect movement of a head in the videofile; calculate a pulse based on the detected movement; detectirregularities of the pulse; and generate a second prediction resultassociated with the irregularities of the pulse. In some embodiments,calculating the pulse based on the detected movement may includeidentifying a plurality of features of the head; decomposingtrajectories of each feature into a set of component motions;determining a component that best corresponds to a heartbeat based onits temporal frequency; identifying peaks associated with the determinedcomponent; and calculating the pulse based on the peaks. In someembodiments, a third weak classifier may be trained to detectirregularities in audio gain of the video file.

According to another aspect of the present disclosure, a method fordetecting synthetic videos may include monitoring, by a server, adigital media source; identifying, by the server, a video on the digitalmedia source; extracting, by the server, the video; analyzing, by theserver, the video with a plurality of weak classifiers; and analyzing,by the server, the prediction result of each of the plurality of weakclassifiers with a strong classifier. Each weak classifier may betrained to detect irregularities of a distinct characteristic of thevideo; and generate a prediction result associated with the distinctcharacteristic, wherein the prediction result is a prediction on whetherthe video is synthetic. The strong classifier may be trained todetermine if the video is synthetic based on the prediction resultsgenerated by each of the weak classifiers.

In some embodiments, each weak classifier of the plurality of weakclassifiers may be trained on real videos and known synthetic videos. Insome embodiments, the known synthetic videos used to train the pluralityof weak classifiers may have had blurry frames removed. In someembodiments, a first weak classifier may be trained to detect a mouth inthe video; extract the mouth from the video; detect irregularities ofthe mouth, wherein irregularities may be associated with teeth or facialhair; and generate a first prediction result based on theirregularities. In some embodiments, extracting the mouth from the videomay include down-sampling the mouth. In some embodiments, extracting themouth from the video may include extracting a pre-defined number offrames from the video in which a mouth has been detected; extracting amouth from each extracted frame; generating the first prediction resultbased on the mouth in each extracted frame; and generating an averageprediction result from the generated first prediction results.

In some embodiments, a second weak classifier may be trained to detectmovement of a head in the video; calculate a pulse based on the detectedmovement; detect irregularities of the pulse; and generate a secondprediction result associated with the irregularities of the pulse. Insome embodiments, calculating the pulse based on the detected movementmay include identifying a plurality of features of the head; decomposingtrajectories of each feature into a set of component motions;determining a component that best corresponds to a heartbeat based onits temporal frequency; identifying peaks associated with the determinedcomponent; and calculating the pulse based on the peaks. In someembodiments, a third weak classifier may be trained to detectirregularities in audio gain of the video. In some embodiments, theprediction result may be a numerical confidence level.

According to another aspect of the present disclosure, a method oftraining a set of weak classifiers to detect synthetic videos mayinclude creating, by one or more processors, a training set, wherein thetraining set comprises a first plurality of videos known to be real anda second plurality of videos known to be synthetic; removing, by the oneor more processors, blurred frames from each video in the training setknown to be synthetic; identifying, by the one or more processors, aphysical feature in each video of the training set; extracting, by theone or more processors, the physical feature from each video of thetraining set; down-sampling, by the one or more processors, theextracted physical feature in at least one frame; and training, by theone or more processors, a first weak classifier to, based on thephysical features and the down-sampled physical features, predictwhether a test video is synthetic. The method may not includeup-sampling.

BRIEF DESCRIPTION OF THE DRAWINGS

Various objectives, features, and advantages of the disclosed subjectmatter can be more fully appreciated with reference to the followingdetailed description of the disclosed subject matter when considered inconnection with the following drawings, in which like reference numeralsidentify like elements.

The drawings are not necessarily to scale, or inclusive of all elementsof a system, emphasis instead generally being placed upon illustratingthe concepts, structures, and techniques sought to be protected herein.

FIG. 1 is a visualization of example processing that may occur to detecta deepfake, according to some embodiments of the present disclosure.

FIG. 2 is a block diagram of an example system for detecting a deepfake,according to some embodiments of the present disclosure.

FIG. 3 is a flowchart showing a process for identifying synthetic videosthat may occur within FIGS. 1 and 2, according to some embodiments ofthe present disclosure.

FIG. 4 is a flowchart showing a process for training a set of weakclassifiers to identify synthetic videos, according to some embodimentsof the present disclosure.

FIGS. 5A-5B show examples of possible irregularities in teeth generatedby deepfakes.

FIG. 6 is a diagram of an illustrative server device that can be usedwithin the system of FIG. 1 or 2, according to some embodiments of thepresent disclosure.

DETAILED DESCRIPTION OF SEVERAL EMBODIMENTS

The following detailed description is merely exemplary in nature and isnot intended to limit the invention or the applications of its use.

Embodiments of the present disclosure relate to systems and methods fordetecting deepfakes using a series of weak classifiers (i.e. weakdetectors, weakly supervised learning models, weak learners, weakpredictors, etc.). This method may utilize a method the same or assimilar to ensemble learning. In some embodiments, the plurality of weakclassifiers may be weighted to make a final classification of interest;in relation to the present disclosure, the final classification may bewhether or not an input video is “real” or “fake”. In some embodiments,a strong classifier may use the weak classifier's classifications asinputs. Each weak classifier may employ weakly supervised learningmodels trained to detect various characteristics of an image or video.These weak classifiers may include, but are not limited to, a mouthdetector, an eye detector, a skin detector, a pulse detector, and anaudio detector. In some embodiments, there may be multiple audiodetectors that are trained to analyze different aspects of the audiotrack (e.g. the gain). In some embodiments, a strong classifier (i.e.strong detector, strongly supervised learning model, strong learner,etc.) may use the outputs of the series of weak classifiers to determineif a video or image is real or fake. In the context of the presentdisclosure, the term deepfake is taken to mean any video with synthetichuman appearances, sounds, or images made to look real. Examples mayinclude, but are not limited to, face swaps and videos where a person ismade to look like they are saying something fake. Deepfakes will bereferred to herein as synthetic videos.

FIG. 1 is a visualization of the process 100 that may occur to detect adeepfake, according to some embodiments of the present disclosure. Image102 may be a frame of a video, or may generally represent a video streamor file. Image 102 proceeds to be analyzed by a plurality of weakclassifiers 104-114. A weak classifier may also be referred to as asub-classifier. Each weak classifier 104-114 may be trained to detectseparate characteristics of image 102. In some embodiments, each weakclassifier 104-114 may be trained with a database of videos known to bedeepfakes and videos known to be real. In some embodiments, videos usedto train the weak classifiers may be obtained from YouTube or otheronline video sources. In some embodiments, the weak classifiers 104-114may be trained with a video known to be a deepfake and the original“real” version of the deepfake. In some embodiments, each weakclassifier 104-114 may be trained with the same dataset, but may betrained to focus on specific characteristics of the image 102. In someembodiments, the weak classifiers 104-114 may be trained only on clearframes (not blurry frames) from known deepfake videos. This may preventthe weak classifiers 104-114 from associating low resolution frames orblurred frames with deepfakes. Blurry frames may be removed from a videostream manually or via standard blur removal algorithms/techniques Insome embodiments, the weak classifiers 104-114 may be trained usingdown-sampled characteristics of a face. In some embodiments, up-sampledframes may be removed from the training database; this may preserve“original” information and help improve accuracy of the weakclassifiers. In some embodiments, at least one of the plurality of weakclassifiers 104-114 may employ a detection method that includesextracting the first 100 frames with a face and averaging allpredictions over those frames. The detection method may extract thefirst 200, or 50, or any pre-determined number of frames with a faces.

In some embodiments, the determinations from each weak classifier104-114 in relation to their respective characteristics may be passed asinputs to the strong classifier 116. In some embodiments, thedetermination or prediction of each weak classifier 104-114 may be aconfidence level (e.g. decimal between 0 and 1). Strong classifier 116may be trained to predict a final classification on the authenticity ofthe input image 102 based on the determinations of each weak classifier.In some embodiments, the possible classifications for strong classifier116 are “real” or “fake”. In some embodiments, the strong classifier 116may predict a confidence level that a video is fake, e.g. a decimalbetween 0 and 1. In some embodiments, strong classifier 116 may be aneural network trained with sets of videos consisting of known deepfakevideos and known real videos to predict, based on theinputs/determinations from each weak classifier 104-114, whether theinput image 102 is real or fake. In some embodiments, the strongclassifier 116 may weigh or adaptively weigh the determinations fromeach weak classifier 104-114 in order to learn to predict whether avideo is real or fake.

In some embodiments, weak classifier 104 may be trained to analyze themouth of input image 102. By definition, deepfakes alter the target'smouth. If a deepfake is generated to make a person look like they aresaying something fake, or something they have never said themselves,this will inherently change the mouth of the person in the video. Facialhair, particularly immediately around the mouth, may not transfer wellfrom a source video to a target video (from source to deepfake). Thus,weak classifier 104 may be trained to detect irregularities in thefacial hair surrounding mouths. The training set, as described earlier,may be trained on a set of videos, comprising videos known to bedeepfakes and videos known to be genuine. In some embodiments, weakclassifier 104 may be trained to detect irregularities in the teeth ofthe face of input image 102. In many deepfake generation techniques,similar to facial hair surrounding a mouth, the teeth of the target maynot transfer well from a source video to an end product deepfake. Thus,weak classifier 104 may be trained to detect irregularities in the teethof the input image 102. Examples of teeth in a source video andcorresponding deepfake are shown in FIGS. 3A-3B. Weak classifier 104 mayoutput whether or not irregularities of the mouth were detected tostrong classifier 116.

In some embodiments, weak classifier 106 may be trained to detectirregularities in pulse. In some embodiments, pulse detection may beperformed according to techniques disclosed in “Detecting Pulse fromHead Motions in Video” et al. Balakrishnan, 2013, which is hereinincorporated by reference in its entirety. Balakrishnan teaches a methodfor measuring the pulse of an individual in a video; due to theNewtonian reaction to the influx of blood at each beat, subtle motionsof the head occur. The technique detects motion of the head, thendecomposes the trajectories of features of the head into a set ofcomponent motions, chooses the component that best corresponds toheartbeats based on its temporal frequency, and identifies the peaks ofthis motion, wherein the peaks correspond to beats. These techniques maybe incorporated into weak classifier 106; weak classifier 106 may betrained to detect and, in some embodiments, calculate a pulse for aperson in input image 102. Weak classifier 106 may be trained to detectirregularities based on these determined pulses, and may output whetheror not irregularities of the pulse were detected to strong classifier116.

In some embodiments, weak classifiers 108-114 may each be trained toanalyze on different characteristics of a video and detectirregularities. Process 100 may include a weak classifier 108 trained todetect irregularities and make predictions based on the eyes of a facein input image 102. In some embodiments, process 100 may include a weakclassifier 110 trained to detect irregularities and make predictionsbased on the skin of a face in input image 102. In some embodiments,process 100 may include a weak classifier 112 trained to detectirregularities and make predictions based on the audio track of inputimage 102. For example, weak classifier 112 may be trained to detectirregularities in the gain. In some embodiments, the background audiomay be stripped to isolate the voice and weak classifier 112 may analyzethe voice, in some cases to detect gain.

FIG. 2 is a block diagram of an example system 200 for detecting adeepfake, according to some embodiments of the present disclosure. Insome embodiments, process 100 of FIG. 1 may be performed within thearchitecture of system 200. System 200 may include user devices 204 a-n(204 generally) that are connected to a digital media source 202 vianetwork 206. Server device 208 may be communicably coupled to digitalmedia source 202 via network 206.

Digital media source 202 may be the dark web or a form of social medianetwork, including, but not limited to, Facebook, Instagram, Twitter,Google+, YouTube, Pinterest, Tumblr, etc. User devices 204 may beconfigured to upload video (or other multimedia content) via network 206to digital media source 202. Server device 208 may be configured tointercept, test, or interact with videos uploaded to the dark web or toa social media network. Server device 208 may be configured to predictwhether a video attempting to be uploaded to digital media source 202 isreal or fake. In some embodiments, server device may include deepfakedetection module 210, which may perform analysis similar to or the sameas process 100 of FIG. 1.

Deepfake detection module 210 may include a plurality of weakclassifiers 212 and a strong classifier 214. In some embodiments, avideo uploaded to digital media source 202 by user device 204 vianetwork 206 may be analyzed by the plurality of weak classifiers 212 ina fashion similar to or the same as process 100. Strong classifier 214may be configured to receive determinations or results from theplurality of weak classifiers 212 and predict whether the video file isreal or fake. In some embodiments, strong classifier 214 may learnweights for each determinations of each weak classifier.

Device 204 may include one or more computing devices capable ofreceiving user input as well as transmitting and/or receiving data vianetwork 206 or communicating with server device 208. In someembodiments, user device 204 may include a conventional computer system,such as a desktop or laptop computer. Alternatively, user device 204 mayinclude a device having computer functionality, such as a personaldigital assistant (PDA), a mobile telephone, a smartphone, or othersuitable device. User device 204 may be configured to send documents vianetwork 206 to server device 208. In some embodiments, user device 204may also be configured to receive encrypted information and display anelectronic version of the originally uploaded document with values thathave been extracted from the original document.

Network 206 may include one or more wide area networks (WANs),metropolitan area networks (MANs), local area networks (LANs), personalarea networks (PANs), or any combination of these networks. Network 206may include a combination of one or more types of networks such asInternet, intranet, Ethernet, twisted-pair, coaxial cable, fiber optic,cellular, satellite, IEEE 8011.11, terrestrial, and/or other types ofwired or wireless networks. Network 206 may also use standardcommunication technologies and/or protocols.

Server device 208 may include any combination of one or more webservers, mainframe computers, general-purpose computers, personalcomputers, or other types of computing devices. Server device 208 mayrepresent distributed servers that are remotely located and communicateover a communications network, or over a dedicated network such as alocal area network (LAN). Server device 208 may also include one or moreback-end servers for carrying out one or more aspects of the presentdisclosure. In some embodiments, server device 208 may be the same as orsimilar to device 400 described below in the context of FIG. 4.

FIG. 3 is a flowchart showing a process 300 for identifying syntheticvideos that may occur within FIGS. 1 and 2, according to someembodiments of the present disclosure. In some embodiments, process 300may be performed by server device 208 of system 200 in FIG. 2. At block301, server device 208 may monitor the dark web, a social media, such asdigital media source 202. In some embodiments, the social media networkmay be YouTube, Twitter, Facebook, or any other similar social network.In some embodiments, server device 208 may monitor a plurality of socialnetworks. At block 302, server device 208 may identify a post (i.e.tweet, Facebook status, etc.) that contains a video. At block 303, inresponse to identifying a post that contains a video, server device 208may extract the video from the post. The extraction may be performed viaany standard techniques for downloading or extracting a video from asocial media post.

At block 304, server device 208 may analyze the video with a pluralityof weak classifiers. In some embodiments, the analysis with each weakclassifier may be performed in parallel, such as in system 100 ofFIG. 1. Each weak classifier may be trained to detect irregularitiesrelated to a distinct characteristic of the video. For example, one ofthe weak classifiers may be trained to analyze and detect irregularitiesrelated to a mouth (i.e. facial hair or teeth) within the video. A weakclassifier may be trained to analyze and detect irregularities relatedto the audio gain of the video. A weak classifier may be trained tocalculate a pulse of a person in the video and detect irregularities ofthe pulse. A weak classifier may be trained to analyze the eyes of aface or skin in the video. Each weak classifier may be trained with aplurality of known real videos and known synthetic videos to makepredictions on the authenticity of the video based on the aforementionedcharacteristics. Each weak classifier may output or generate aprediction result on the authenticity of the video. In some embodiments,the prediction result may be a score or numerical decimal reflecting aconfidence level of authenticity. In some embodiments, the score may bebetween zero and one. Additional details of the training process arediscussed in relation to FIG. 4.

At block 305, server device 208 may analyze the results of the pluralityof weak classifiers with a strong classifier. In some embodiments, theserver device 208 may compile or assemble the prediction results (i.e.prediction scores/confidence levels/decimals) and feed theresults/scores as inputs to a strong classifier. The strong classifiermay be trained to predict whether the video if real or synthetic basedon the results from the weak classifiers. In some embodiments, thestrong classifier may generate a confidence level or score that reflectsthe likelihood the video is synthetic. In some embodiments, the scoremay be between zero and one. In some embodiments, the strong classifiermay be trained with a training set of videos comprising a plurality ofknown real videos and known synthetic videos. The strong classifier maybe trained to, based on the results of the weak classifiers, make afinal classification on the authenticity of the video. The strongclassifier may weigh, adaptively weigh, or learn to adapt weightsassociated with each result from the weak classifiers.

FIG. 4 is a flowchart showing a process 400 for training a set of weakclassifiers to identify synthetic videos, according to some embodimentsof the present disclosure. In some embodiments, the weak classifiers maybe the plurality of weak classifiers 212 in system 200. In someembodiments, process 400 may be performed by server device 208 of system200, although the process may also be performed by any other serverdevice or computing device, not necessarily the same one that uses thetrained classifiers. At block 401, the server may create a training set.In some embodiments, the training set may include a plurality of videos(e.g. at least fifty to one hundred videos). The plurality of videos mayinclude both videos known to be authentic and videos known to besynthetic (i.e. known deepfakes). In some embodiments, the training setmay include both real and fake versions of the same original clip. Atblock 402, the server may remove all blurred frames from the videos inthe training set known to be synthetic. In some embodiments, this mayreduce the likelihood of any classifiers associating blurriness withinauthenticity and increase the accuracy of the classifiers. The removalof blurred frames may be performed manually or by any number of knowntechniques for detecting blur in frames and removing them.

At block 403, the server may identify a physical feature (i.e. mouth,eyes, etc.) within each video frame of each video in the training set.This block may be performed by any number of standard object detectionor image segmentation techniques. At block 404, the server may extractthe identified physical feature from each frame of each video in thetraining set. In some embodiments, this step can be used to createmultiple training sets for multiple weak classifiers (i.e. a set for aclassifier to analyze a mouth and a set for a classifier to analyzeeyes). At block 405, the server may down-sample the physical feature inat least one frame. In some embodiments, some videos may requiredown-sampling while others may not. In some embodiments, there may beno-up-sampling performed in process 400, only down-sampling. At block405, a weak classifier is trained to, based on both the physicalfeatures and the down-sampled physical features, predict whether a testvideo is synthetic. In some embodiments, the prediction may include aconfidence level or score, similar to the scores discussed in relationto block 305 of FIG. 3.

FIGS. 5A-5B show examples of possible irregularities in teeth generatedby deepfakes. A face extracted from a real video is shown on the leftimage of FIGS. 5A and 5B. As expected in a real video of a real person,individual teeth are easily identified. The right image of FIGS. 5A and5B are examples of a possible shortcoming of deepfake videos and mayshow how deepfake videos may lack in their ability to generate detailsof mouths; the teeth of the face in the fake video appear as just ablob, as opposed to individual teeth. In some embodiments, weakclassifier 104, or any weak classifier that is trained to analyze themouth in the context of process 100, may be trained to detectirregularities in the teeth of a face such as shown in FIGS. 5A-5B.

FIG. 6 is a diagram of an illustrative server device 600 that can beused within system 200 of FIG. 1, according to some embodiments of thepresent disclosure. Server device 600 may implement various features andprocesses as described herein. Server device 600 may be implemented onany electronic device that runs software applications derived fromcomplied instructions, including without limitation personal computers,servers, smart phones, media players, electronic tablets, game consoles,email devices, etc. In some implementations, server device 600 mayinclude one or more processors 602, volatile memory 604, non-volatilememory 606, and one or more peripherals 608. These components may beinterconnected by one or more computer buses 610.

Processor(s) 602 may use any known processor technology, including butnot limited to graphics processors and multi-core processors. Suitableprocessors for the execution of a program of instructions may include,by way of example, both general and special purpose microprocessors, andthe sole processor or one of multiple processors or cores, of any kindof computer. Bus 610 may be any known internal or external bustechnology, including but not limited to ISA, EISA, PCI, PCI Express,NuBus, USB, Serial ATA, or FireWire. Volatile memory 604 may include,for example, SDRAM. Processor 602 may receive instructions and data froma read-only memory or a random access memory or both. The essentialelements of a computer may include a processor for executinginstructions and one or more memories for storing instructions and data.

Non-volatile memory 606 may include by way of example semiconductormemory devices, such as EPROM, EEPROM, and flash memory devices;magnetic disks such as internal hard disks and removable disks;magneto-optical disks; and CD-ROM and DVD-ROM disks. Non-volatile memory606 may store various computer instructions including operating systeminstructions 612, communication instructions 614, applicationinstructions 616, and application data 617. Operating systeminstructions 612 may include instructions for implementing an operatingsystem (e.g., Mac OS®, Windows®, or Linux). The operating system may bemulti-user, multiprocessing, multitasking, multithreading, real-time,and the like. Communication instructions 614 may include networkcommunications instructions, for example, software for implementingcommunication protocols, such as TCP/IP, HTTP, Ethernet, telephony, etc.Application instructions 616 can include instructions for detectingdeepfake videos using a plurality of weak classifiers according to thesystems and methods disclosed herein. For example, applicationinstructions 616 may include instructions for components 212-214described above in conjunction with FIG. 1.

Peripherals 608 may be included within server device 600 or operativelycoupled to communicate with server device 600. Peripherals 608 mayinclude, for example, network subsystem 618, input controller 620, anddisk controller 622. Network subsystem 618 may include, for example, anEthernet of WiFi adapter. Input controller 620 may be any known inputdevice technology, including but not limited to a keyboard (including avirtual keyboard), mouse, track ball, and touch-sensitive pad ordisplay. Disk controller 622 may include one or more mass storagedevices for storing data files; such devices include magnetic disks,such as internal hard disks and removable disks; magneto-optical disks;and optical disks.

Methods described herein may represent processing that occurs within asystem for detecting deepfake videos using a plurality of weakclassifiers (e.g., process 100 of FIG. 1). The subject matter describedherein can be implemented in digital electronic circuitry, or incomputer software, firmware, or hardware, including the structural meansdisclosed in this specification and structural equivalents thereof, orin combinations of them. The subject matter described herein can beimplemented as one or more computer program products, such as one ormore computer programs tangibly embodied in an information carrier(e.g., in a machine readable storage device), or embodied in apropagated signal, for execution by, or to control the operation of,data processing apparatus (e.g., a programmable processor, a computer,or multiple computers). A computer program (also known as a program,software, software application, or code) can be written in any form ofprogramming language, including compiled or interpreted languages, andit can be deployed in any form, including as a stand-alone program or asa module, component, subroutine, or other unit suitable for use in acomputing environment. A computer program does not necessarilycorrespond to a file. A program can be stored in a portion of a filethat holds other programs or data, in a single file dedicated to theprogram in question, or in multiple coordinated files (e.g., files thatstore one or more modules, sub programs, or portions of code). Acomputer program can be deployed to be executed on one computer or onmultiple computers at one site or distributed across multiple sites andinterconnected by a communication network.

The processes and logic flows described in this specification, includingthe method steps of the subject matter described herein, can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions of the subject matter describedherein by operating on input data and generating output. The processesand logic flows can also be performed by, and apparatus of the subjectmatter described herein can be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processor of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read only memory ora random access memory or both. The essential elements of a computer area processor for executing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto optical disks, or optical disks. Information carrierssuitable for embodying computer program instructions and data includeall forms of nonvolatile memory, including by way of examplesemiconductor memory devices, such as EPROM, EEPROM, flash memorydevice, or magnetic disks. The processor and the memory can besupplemented by, or incorporated in, special purpose logic circuitry.

It is to be understood that the disclosed subject matter is not limitedin its application to the details of construction and to thearrangements of the components set forth in the following description orillustrated in the drawings. The disclosed subject matter is capable ofother embodiments and of being practiced and carried out in variousways. Also, it is to be understood that the phraseology and terminologyemployed herein are for the purpose of description and should not beregarded as limiting. As such, those skilled in the art will appreciatethat the conception, upon which this disclosure is based, may readily beutilized as a basis for the designing of other structures, methods, andsystems for carrying out the several purposes of the disclosed subjectmatter. It is important, therefore, that the claims be regarded asincluding such equivalent constructions insofar as they do not departfrom the spirit and scope of the disclosed subject matter.

Although the disclosed subject matter has been described and illustratedin the foregoing illustrative embodiments, it is understood that thepresent disclosure has been made only by way of example, and thatnumerous changes in the details of implementation of the disclosedsubject matter may be made without departing from the spirit and scopeof the disclosed subject matter.

The invention claimed is:
 1. A system for detecting fake videoscomprising: a server configured to: receive a prediction result fromeach of a plurality of weak classifiers; and send the prediction resultsfrom each of the plurality of weak classifiers to a strong classifier; aplurality of weak classifiers, each weak classifier being trained onreal videos and known fake videos to: analyze a distinct characteristicof a video file; detect irregularities of the distinct characteristic;in response to detecting irregularities of the distinct characteristic,generate a prediction result dependent on the detected irregularities,wherein the prediction result is a prediction on whether the video fileis fake; and output the prediction result to the server; and a strongclassifier trained to: receive the prediction result of each of theplurality of weak classifiers from the server; analyze the predictionresult from each of the plurality of weak classifiers; and determine ifthe video file is fake based on the prediction results.
 2. The system ofclaim 1, wherein the prediction result is a numerical confidence level.3. The system of claim 1, wherein the known fake videos used to trainthe plurality of weak classifiers have had blurry frames removed.
 4. Thesystem of claim 1, wherein a first weak classifier is trained to: detecta mouth in the video file; extract the mouth from the video file; detectirregularities of the mouth, wherein irregularities may be associatedwith teeth or facial hair; and generate a first prediction result basedon the irregularities.
 5. The system of claim 4, wherein extracting themouth from the video file comprises down-sampling the mouth.
 6. Thesystem of claim 4, wherein extracting the mouth from the video filecomprises: extracting a pre-defined number of frames from the video filein which a mouth has been detected; extracting a mouth from eachextracted frame; generating the first prediction result based on themouth in each extracted frame; and generating an average predictionresult from the generated first prediction results.
 7. The system ofclaim 1, wherein a second weak classifier is trained to: detect movementof a head in the video file; calculate a pulse based on the detectedmovement; detect irregularities of the pulse; and generate a secondprediction result associated with the irregularities of the pulse. 8.The system of claim 7, wherein calculating the pulse based on thedetected movement comprises: identifying a plurality of features of thehead; decomposing trajectories of each feature into a set of componentmotions; determining a component that best corresponds to a heartbeatbased on its temporal frequency; identifying peaks associated with thedetermined component; and calculating the pulse based on the peaks. 9.The system of claim 1, wherein a third weak classifier is trained todetect irregularities in audio gain of the video file.
 10. A method fordetecting fake videos comprising: monitoring, by a server, a digitalmedia source; identifying, by the server, a video on the digital mediasource; extracting, by the server, the video; analyzing, by the server,the video with a plurality of weak classifiers, each weak classifierbeing trained to: detect irregularities of a distinct characteristic ofthe video; and in response to detecting irregularities of the distinctcharacteristic, generate a prediction result dependent on the detectedirregularities, wherein the prediction result is a prediction on whetherthe video is fake; analyzing, by the server, the prediction result ofeach of the plurality of weak classifiers with a strong classifier, thestrong classifier being trained to determine if the video is fake basedon the prediction results generated by each of the weak classifiers. 11.The method of claim 10, wherein each weak classifier of the plurality ofweak classifiers is trained on real videos and known fake videos. 12.The method of claim 11, wherein the known fake videos used to train theplurality of weak classifiers have had blurry frames removed.
 13. Themethod of claim 10, wherein a first weak classifier is trained to:detect a mouth in the video; extract the mouth from the video; detectirregularities of the mouth, wherein irregularities may be associatedwith teeth or facial hair; and generate a first prediction result basedon the irregularities.
 14. The method of claim 13, wherein extractingthe mouth from the video comprises down-sampling the mouth.
 15. Themethod of claim 13, wherein extracting the mouth from the videocomprises: extracting a pre-defined number of frames from the video inwhich a mouth has been detected; extracting a mouth from each extractedframe; generating the first prediction result based on the mouth in eachextracted frame; and generating an average prediction result from thegenerated first prediction results.
 16. The method of claim 10, whereina second weak classifier is trained to: detect movement of a head in thevideo; calculate a pulse based on the detected movement; detectirregularities of the pulse; and generate a second prediction resultassociated with the irregularities of the pulse.
 17. The method of claim16, wherein calculating the pulse based on the detected movementcomprises: identifying a plurality of features of the head; decomposingtrajectories of each feature into a set of component motions;determining a component that best corresponds to a heartbeat based onits temporal frequency; identifying peaks associated with the determinedcomponent; and calculating the pulse based on the peaks.
 18. The methodof claim 10, wherein a third weak classifier is trained to detectirregularities in audio gain of the video.
 19. The method of claim 10,wherein the prediction result is a numerical confidence level.